knitr::opts_chunk$set(warning=FALSE,message=FALSE,error=FALSE)

Handwork

My policy on handwork is that it is required for Statistics PhD students and MS in Statistics students. It is encouraged (but not required) for MS in Applied Statistics and all other students. (If you are the latter, you will not get penalized if you get these wrong … )

Exercises from the book: 17.1, 17.2, 18.3, 18.5

You can type your answers and submit within this file or you can do this work on paper and submit a scanned copy (be sure it is legible).

17.1

17.2

18.3

Answer:

From the plot, local linear regression estimator is less biased, because it shows smaller gaps between the true line and itself.

And compared with kernel smoothing, local linear estimator shows better performance at the boundaries, because local weighted methods will be influenced by large-value points around.

library(stats)
set.seed(2)
#simulated data
x<-runif(100, 0, 100)
e<-rnorm(100, 0, 20)
y<-100-5*(x/10-5)^2+(x/10-5)^3+e


f<-function(x){
  y<-100-5*(x/10-5)^2+(x/10-5)^3
  return(y)
}

x.seq<-seq(min(x), max(x), length.out = 100)
# true line
Ey<-100-5*(x.seq/10-5)^2+(x.seq/10-5)^3
# local linear
fit.18.ll<-loess(y~x, degree = 1, span = 0.4)

y.ll.pred<-predict(fit.18.ll, data.frame(x=x.seq))


plot(y~x)
lines(x.seq, Ey, lwd=2)
lines(x.seq, y.ll.pred, col="red", lty=2, lwd=2)
#kernel regression/kernel smoothing
lines(ksmooth(x, y, "normal", bandwidth = 18), col="green", lty=3, lwd=3)

legend(67,-50, legend = c("true regression", "local-linear", "kernel smoothing"), col = c("black", "red", "green"), lty=c(1,2,3), cex=1 )

18.5

Answer:

Span=0.3 leads to smallest ASE.

It kind of confirm my visual selection in 18.3, because when span value rises, the regression line first fits better and then worse, which corresponds with the pattern of the graph. But span=0.3 cause the fitted line to be a little too wiggly, thus, I choose span=0.4, ACE doesn’t differ so much, though.

ss<-seq(0.05, 0.95, by=0.05)
n<-length(ss)
n.y<-length(y)
df.ase<-data.frame(span=ss, ASE=rep(0,n))
Ey0<-100-5*(x/10-5)^2+(x/10-5)^3

for (i in 1:n) {
  sp<-ss[i]
  fit<-loess(y~x, degree = 1, span = sp)
  y.hat<-predict(fit)
  df.ase[i,2]<-sum((y.hat-Ey0)^2)/n.y
  
}
with(df.ase, plot(ASE~span))

df.ase[which.min(df.ase$ASE),1]
## [1] 0.3

Data analysis

1. Exercise D17.1

The data in ginzberg.txt (collected by Ginzberg) were analyzed by Monette (1990). The data are for a group of 82 psychiatric patients hospitalized for depression. The response variable in the data set is the patient’s score on the Beck scale, a widely used measure of depression. The explanatory variables are “simplicity” (measuring the degree to which the patient “sees the world in black and white”) and “fatalism”. (These three variables have been adjusted for other explanatory variables that can influence depression.)

Using the full quadratic regression model \(Y = \beta_0 + \beta_ 1X_1 + \beta _2X_2 + \beta _3X_1^2 + \beta_4X_2^2 + \beta_5X_1X_2 + \epsilon\) regress the Beck-scale scores on simplicity and fatalism.

  1. Are the quadratic and product terms needed here?

Answer:

Due to the scatterplots, it doesn’t seem that there is quadratic relationship between depression and fatalism, but there might be non-linear relationship between depression and simplicity, but not quadratic.

Due to the anova outcome, the interaction term is needed here, but the two quadratic terms don’t seem needed.

p.s. Is it possible for a model to include higher terms without lower ones??

# polynomial regression
library(car)
df.g<-Ginzberg
# check the relationship intuitively
scatterplot(adjdep~adjsimp, data=df.g)

scatterplot(adjdep~adjfatal, data=df.g)

fit.0<-lm(adjdep~adjsimp+adjfatal, data = df.g)
fit.1<-lm(adjdep~adjsimp+adjfatal+adjfatal*adjsimp+I(adjsimp^2)+I(adjfatal^2), data = df.g)
fit.1.1<-lm(adjdep~adjsimp+adjfatal+adjfatal*adjsimp, data = df.g)
fit.1.2<-lm(adjdep~adjsimp+adjfatal+I(adjsimp^2), data = df.g)
fit.1.3<-lm(adjdep~adjsimp+adjfatal+I(adjfatal^2), data = df.g)
#brief(fit.1) #brief summary of fit model
summary(fit.1)
## 
## Call:
## lm(formula = adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp + 
##     I(adjsimp^2) + I(adjfatal^2), data = df.g)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.63090 -0.23619 -0.06667  0.21295  1.04481 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      -0.10865    0.20687  -0.525   0.6010   
## adjsimp           0.82039    0.29702   2.762   0.0072 **
## adjfatal          0.55247    0.28391   1.946   0.0554 . 
## I(adjsimp^2)      0.08172    0.15374   0.532   0.5966   
## I(adjfatal^2)     0.20927    0.18663   1.121   0.2657   
## adjsimp:adjfatal -0.55366    0.27855  -1.988   0.0505 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3738 on 76 degrees of freedom
## Multiple R-squared:  0.4755, Adjusted R-squared:  0.441 
## F-statistic: 13.78 on 5 and 76 DF,  p-value: 1.416e-09
anova(fit.0, fit.1)
## Analysis of Variance Table
## 
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp + I(adjsimp^2) + 
##     I(adjfatal^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     79 11.478                           
## 2     76 10.621  3   0.85711 2.0444 0.1147
anova(fit.0, fit.1.1)
## Analysis of Variance Table
## 
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + adjfatal * adjsimp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     79 11.478                              
## 2     78 10.798  1   0.67981 4.9106 0.02961 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.0, fit.1.2)
## Analysis of Variance Table
## 
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + I(adjsimp^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     79 11.478                           
## 2     78 11.198  1   0.28013 1.9513 0.1664
anova(fit.0, fit.1.3)
## Analysis of Variance Table
## 
## Model 1: adjdep ~ adjsimp + adjfatal
## Model 2: adjdep ~ adjsimp + adjfatal + I(adjfatal^2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     79 11.478                           
## 2     78 11.397  1  0.081148 0.5554 0.4584
fit.2<-fit.1.1
  1. If you have access to suitable software, graph the data and the fitted regression surface in three dimensions. Do you see any problems with the data?

Answer: from the plot, we can see that there are some outlier points, and there is also an obvious high-leverage point. In this way, there might also be some influential points.

# 3-dimensional plot; plane
library("plot3D")
x<-df.g$adjsimp
y<-df.g$adjfatal
z<-df.g$adjdep

fitpoints<-fit.2$fitted.values
grid.lines<-10
x.pred<-seq(min(df.g$adjsimp),max(df.g$adjsimp), length.out = grid.lines)
y.pred<-seq(min(df.g$adjfatal),max(df.g$adjfatal), length.out = grid.lines)
xy <- expand.grid(adjsimp = x.pred, adjfatal = y.pred) # create grids of x-y
z.pred <- matrix(predict(fit.2, newdata = xy), nrow = grid.lines, ncol = grid.lines) #should get predicted values of every grid of xy

scatter3D(x, y, z, pch = 19, cex = 1, colvar = NULL, col="red", 
          theta = 20, phi = 5, bty="b",
          xlab = "simplicity", ylab = "fatalism", zlab = "depression",  
          surf = list(x = x.pred, y = y.pred, z = z.pred,  
                      facets = TRUE, fit = fitpoints, col=ramp.col (col = c("dodgerblue3","seagreen2"), n = 300, alpha=0.9), border="black"), main = "depression regression")

  1. What do standard regression diagnostics for influential observations show?

Answer:

Due to Cook’s distance, we know that point 71, 65, 80 are influential. From the added-variable plots, point 65 is influential to all of the three terms, while 80 only seems to affect “simplicity” term. But it’s hard to tell how 71 will affect the regression lines if withdrawn.

influenceIndexPlot(fit.2)

influencePlot(fit.2, id.method="identify")

##       StudRes        Hat      CookD
## 65 -0.7830169 0.61224209 0.24322281
## 70  2.7416683 0.01504210 0.02648577
## 71  3.7104658 0.09778806 0.32058125
## 80  1.7379439 0.22761790 0.21690988
avPlots(fit.2)

2. Exercise D18.2

For this analysis, use the States.txt data, which includes average SAT scores for each state as the outcome.

  1. Put together a model with SAT math (SATM) as the outcome and region, pop, percent, dollars, and pay as the explanatory variables, each included as linear terms. Interpret the findings.

Answer:

From the summary of this “pure” linear model, we can see that only “region” and “percent” are statistically significant. The \(R^2\) shows that this model somehow has explained quite a large percentage (around 85%) of the variance of the response (SAT math). It seems to be a good fit if we just consider \(R^2\). However, there is additional information from the crplots.

We can see the partial relationships between SAT math score and “percent” is “perfectly” linear, but it seems that there exists some missing data in the middle? And the relationships between the response and “pop”, “dollars”, “pay” are seemingly non-linear, i.e., there exist certain patterns, the response-“pay”-pattern is less obvious, though. In other words, there could be some relationships between them and we cannot just “kick them out” before further validation.

Thus, if we want to figure out the effect of covariates other than “region” and “percent”, the data need to be re-fitted using non-linear regressors, e.g., non-parametric terms.

df.s<-States
fit.lm<-lm(SATM~region+pop+percent+dollars+pay, data = df.s)
fit.lm.1<-lm(SATM~region+percent, data = df.s)
summary(fit.lm)
## 
## Call:
## lm(formula = SATM ~ region + pop + percent + dollars + pay, data = df.s)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.4293  -8.4650  -0.6388   9.9749  23.5808 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.158e+02  1.939e+01  26.603  < 2e-16 ***
## regionESC   -2.811e+00  1.019e+01  -0.276  0.78419    
## regionMA     1.215e+01  1.539e+01   0.789  0.43493    
## regionMTN    1.305e+00  8.681e+00   0.150  0.88126    
## regionNE     1.386e+01  1.270e+01   1.091  0.28197    
## regionPAC    2.198e+00  9.715e+00   0.226  0.82219    
## regionSA    -1.191e+01  1.021e+01  -1.166  0.25083    
## regionWNC    2.725e+01  8.931e+00   3.051  0.00414 ** 
## regionWSC   -7.939e+00  1.011e+01  -0.785  0.43721    
## pop         -2.032e-04  5.037e-04  -0.403  0.68891    
## percent     -1.282e+00  2.099e-01  -6.109 4.04e-07 ***
## dollars      1.355e+00  4.062e+00   0.334  0.74053    
## pay          4.951e-01  9.754e-01   0.508  0.61469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.41 on 38 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8495 
## F-statistic: 24.52 on 12 and 38 DF,  p-value: 3.062e-14
anova(fit.lm.1, fit.lm)
## Analysis of Variance Table
## 
## Model 1: SATM ~ region + percent
## Model 2: SATM ~ region + pop + percent + dollars + pay
##   Res.Df  RSS Df Sum of Sq      F Pr(>F)
## 1     41 7188                           
## 2     38 6833  3    354.92 0.6579  0.583
crPlots(fit.lm)

  1. Now, instead approach building this model using the nonparametric-regression methods of this chapter. Fit a general nonparametric regression model and an additive-regression model, comparing the results to each other and to the linear least-squares fit to the data (in (a)).

Answer:

The general nonparametric regression model and the additive-regression model both show that among all the numeric variables, “percent” is the only one which is statistically significant.

And the additive-regression model (generalized version with a parametric part) gives the same result as the linear model, i.e., only “region”(regionWNC) and “percent” terms are statistically significant. From the plots of smooth terms, we can see “pay” and “pop” indeed don’t show any relationships with SATM, while “dollar” has shown a relationship, with large variance, though.

It seems that the population, the investment on public education, the salary of teachers are uncorrelated with students’ SAT math score, or at least, cannot be modeled by local polynomial model and additive regression model.

library(mgcv)
library(stats)
# general nonparametric regression model (I used to think this refers to Y^=f(x1,x2,..,xp))
# local polynomial (<= 4 variables)
fit.lp.0<-loess(SATM~pop+percent+dollars+pay, degree = 1, data = df.s)# degree: power of the highest polynomial term
#summary(fit.lp.0) #show nothing...
#find significant terms
fit.lp.1<-loess(SATM~percent+dollars+pay, degree = 1, data = df.s)
fit.lp.2<-loess(SATM~pop+dollars+pay, degree = 1, data = df.s)
fit.lp.3<-loess(SATM~pop+percent+pay, degree = 1, data = df.s)
fit.lp.4<-loess(SATM~pop+percent+dollars, degree = 1, data = df.s)
anova(fit.lp.0, fit.lp.1)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ percent + dollars + pay, data = df.s, degree = 1)
## 
## Analysis of Variance:   denominator df 39.26
## 
##       ENP   RSS F-value Pr(>F)
## [1,] 8.32 11259               
## [2,] 6.55 10143 0.73051 0.5599
anova(fit.lp.0, fit.lp.2)#percent->significant
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + dollars + pay, data = df.s, degree = 1)
## 
## Analysis of Variance:   denominator df 39.26
## 
##       ENP   RSS F-value   Pr(>F)    
## [1,] 8.32 11259                     
## [2,] 6.93 36659  20.169 7.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit.lp.0, fit.lp.3)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + percent + pay, data = df.s, degree = 1)
## 
## Analysis of Variance:   denominator df 39.26
## 
##       ENP     RSS F-value Pr(>F)
## [1,] 8.32 11259.4               
## [2,] 7.21  9751.1  1.4131 0.2556
anova(fit.lp.0, fit.lp.4)
## Model 1: loess(formula = SATM ~ pop + percent + dollars + pay, data = df.s, degree = 1)
## Model 2: loess(formula = SATM ~ pop + percent + dollars, data = df.s, degree = 1)
## 
## Analysis of Variance:   denominator df 39.26
## 
##       ENP   RSS F-value Pr(>F)
## [1,] 8.32 11259               
## [2,] 6.94 10016 0.98897 0.4039
fit.lp<-loess(SATM~percent, degree = 1, data = df.s)
#plot fitted model
plot(SATM~percent, data=df.s)
ss<-with(df.s, seq(min(percent), max(percent), length.out = 200))
satm.pred<-predict(fit.lp, data.frame(percent=ss))
lines(ss, satm.pred, col="red")
lines(with(df.s, smooth.spline(percent, SATM, df=3.85)), col="green") #smooth splines

#fit.gen<-gam(SATM~s(percent), data = df.s)
# additive-regression model(semi-parametric in order to add "region")
fit.add<-gam(SATM~region+s(pop)+s(percent)+s(dollars)+s(pay), data = df.s)
summary(fit.add)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## SATM ~ region + s(pop) + s(percent) + s(dollars) + s(pay)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 492.5039     6.1674  79.856  < 2e-16 ***
## regionESC     2.8933     9.5428   0.303  0.76364    
## regionMA     -4.1707    14.7132  -0.283  0.77858    
## regionMTN    13.7467     8.1271   1.691  0.10014    
## regionNE     -2.2987    12.0695  -0.190  0.85011    
## regionPAC    12.6397     8.8165   1.434  0.16105    
## regionSA    -12.6855     8.5796  -1.479  0.14871    
## regionWNC    29.5165     8.1956   3.601  0.00102 ** 
## regionWSC    -0.3958     9.6652  -0.041  0.96758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df      F p-value    
## s(pop)     1.000  1.000  0.143   0.708    
## s(percent) 2.633  3.355 18.879 2.1e-08 ***
## s(dollars) 4.263  5.295  1.780   0.124    
## s(pay)     1.000  1.000  0.149   0.702    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.901   Deviance explained = 93.5%
## GCV = 182.01  Scale est. = 118.15    n = 51
plot(fit.add)

  1. Can you handle the nonlinearity by a transformation or by another parametric regression model, such as a polynomial regression? Investigate and explain. What are the tradeoffs between these nonparametric and parametric approaches?

Answer:

  • Handle Nonlinearity

I have tried log transformation for the response, and it doesn’t seem to improve a lot.

Log transformation of “pop” seems to make it uncorrelated with the response, because the component-residual plot shows a seemingly random pattern around y=0.

As for “dollars” and “pay”, neither log transformation nor adding polynomial terms seem to improve the fit, i.e., their relationships with SAT math couldn’t be captured as “simple” parametric forms, maybe they are too complex. Or more likely, due to the original crPlots and the further results from non-parametric models, they do not have special relationships with SAT math.

  • Trade-off

Interpretation: compared with parametric approaches, non-parametric methods are more flexible and fit the data better, but they are more difficult to interpret since the smoothers are usually very complicated.

Multivariate: general non-parametric models suffer from “the curse of dimensionality” and additive regression models are restricted to separate fitting for each explanatory variables; while parametric models are able to fit both separate effects and interactive effects

Generalization: since non-parametric models rely more on the data, over-fitting is inevitable, i.e., they couldn’t be generalized to other datasets easily. Conversely, although the parametric models cannot “please” the existing data well enough due to their “too-simple” structures, they are able to “adapt to” other datasets better.

p.s. semi-parametric methods seem to be the best……

fit.lm.tr1<-lm(log(SATM)~region+pop+percent+dollars+pay, data = df.s)
fit.lm.tr2<-lm(log(SATM)~region+pop+percent+dollars+pay, data = df.s)
fit.lm.tr3<-lm(SATM~region+poly(pop,2, raw=TRUE)+percent+poly(dollars, 5, raw = TRUE)+poly(pay, 5, raw = TRUE), data = df.s)

summary(fit.lm.tr1)
## 
## Call:
## lm(formula = log(SATM) ~ region + pop + percent + dollars + pay, 
##     data = df.s)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044595 -0.017204  0.000329  0.019727  0.049866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.241e+00  3.890e-02 160.455  < 2e-16 ***
## regionESC   -4.775e-03  2.045e-02  -0.233   0.8166    
## regionMA     2.675e-02  3.088e-02   0.866   0.3918    
## regionMTN    3.987e-03  1.742e-02   0.229   0.8201    
## regionNE     3.035e-02  2.548e-02   1.191   0.2410    
## regionPAC    6.615e-03  1.949e-02   0.339   0.7362    
## regionSA    -2.410e-02  2.049e-02  -1.176   0.2469    
## regionWNC    5.120e-02  1.792e-02   2.857   0.0069 ** 
## regionWSC   -1.463e-02  2.029e-02  -0.721   0.4751    
## pop         -4.105e-07  1.011e-06  -0.406   0.6869    
## percent     -2.625e-03  4.211e-04  -6.234 2.72e-07 ***
## dollars      2.337e-03  8.149e-03   0.287   0.7758    
## pay          1.189e-03  1.957e-03   0.607   0.5472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0269 on 38 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8489 
## F-statistic: 24.41 on 12 and 38 DF,  p-value: 3.295e-14
summary(fit.lm.tr2)
## 
## Call:
## lm(formula = log(SATM) ~ region + pop + percent + dollars + pay, 
##     data = df.s)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044595 -0.017204  0.000329  0.019727  0.049866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.241e+00  3.890e-02 160.455  < 2e-16 ***
## regionESC   -4.775e-03  2.045e-02  -0.233   0.8166    
## regionMA     2.675e-02  3.088e-02   0.866   0.3918    
## regionMTN    3.987e-03  1.742e-02   0.229   0.8201    
## regionNE     3.035e-02  2.548e-02   1.191   0.2410    
## regionPAC    6.615e-03  1.949e-02   0.339   0.7362    
## regionSA    -2.410e-02  2.049e-02  -1.176   0.2469    
## regionWNC    5.120e-02  1.792e-02   2.857   0.0069 ** 
## regionWSC   -1.463e-02  2.029e-02  -0.721   0.4751    
## pop         -4.105e-07  1.011e-06  -0.406   0.6869    
## percent     -2.625e-03  4.211e-04  -6.234 2.72e-07 ***
## dollars      2.337e-03  8.149e-03   0.287   0.7758    
## pay          1.189e-03  1.957e-03   0.607   0.5472    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0269 on 38 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8489 
## F-statistic: 24.41 on 12 and 38 DF,  p-value: 3.295e-14
summary(fit.lm.tr3)
## 
## Call:
## lm(formula = SATM ~ region + poly(pop, 2, raw = TRUE) + percent + 
##     poly(dollars, 5, raw = TRUE) + poly(pay, 5, raw = TRUE), 
##     data = df.s)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.3692  -8.4480  -0.4978   8.4389  22.5876 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -8.989e+03  1.865e+04  -0.482 0.633451    
## regionESC                      2.182e+01  1.469e+01   1.485 0.148340    
## regionMA                       6.983e+00  1.817e+01   0.384 0.703484    
## regionMTN                      1.268e+01  1.164e+01   1.089 0.284953    
## regionNE                       1.272e+01  1.629e+01   0.781 0.441098    
## regionPAC                      2.338e+00  1.298e+01   0.180 0.858340    
## regionSA                      -9.261e+00  1.210e+01  -0.766 0.450072    
## regionWNC                      3.920e+01  1.109e+01   3.535 0.001390 ** 
## regionWSC                      8.404e+00  1.315e+01   0.639 0.527892    
## poly(pop, 2, raw = TRUE)1     -4.518e-05  1.631e-03  -0.028 0.978092    
## poly(pop, 2, raw = TRUE)2     -2.132e-09  5.973e-08  -0.036 0.971767    
## percent                       -1.117e+00  2.539e-01  -4.397 0.000135 ***
## poly(dollars, 5, raw = TRUE)1 -1.782e+03  1.491e+03  -1.195 0.241798    
## poly(dollars, 5, raw = TRUE)2  6.141e+02  5.494e+02   1.118 0.272872    
## poly(dollars, 5, raw = TRUE)3 -1.003e+02  9.777e+01  -1.026 0.313440    
## poly(dollars, 5, raw = TRUE)4  7.811e+00  8.418e+00   0.928 0.361150    
## poly(dollars, 5, raw = TRUE)5 -2.333e-01  2.813e-01  -0.829 0.413755    
## poly(pay, 5, raw = TRUE)1      1.950e+03  3.071e+03   0.635 0.530411    
## poly(pay, 5, raw = TRUE)2     -1.297e+02  1.980e+02  -0.655 0.517555    
## poly(pay, 5, raw = TRUE)3      4.222e+00  6.300e+00   0.670 0.508093    
## poly(pay, 5, raw = TRUE)4     -6.740e-02  9.900e-02  -0.681 0.501428    
## poly(pay, 5, raw = TRUE)5      4.229e-04  6.146e-04   0.688 0.496849    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.65 on 29 degrees of freedom
## Multiple R-squared:  0.9095, Adjusted R-squared:  0.844 
## F-statistic: 13.88 on 21 and 29 DF,  p-value: 5.242e-10
crPlots(fit.lm.tr3)

3. Exercise D18.3

Return to the Chile.txt dataset used in HW2. Reanalyze the data employing generalized nonparametric regression (including generalized additive) models.

  1. What, if anything, do you learn about the data from the nonparametric regression?

Answer:

From the several anova results, we can see that age, statusquo, education are the three statistically significant variables. The “statusquo” is the only one which has non-linear relationship (like polynomial, close to quadratic in the middle part) with the response “vote”, but at borders, the variance is large.

library(arm)
library(dplyr)
df.c<-Chile %>%
  na.omit()
fit.gam.0<-gam(vote~region+as.factor(population)+sex+s(age)+education+income+s(statusquo), family = binomial, data=df.c) #age, statusquo, education are significant

fit.gam.1<-gam(vote~s(age)+education+s(statusquo), family = binomial, data=df.c)
anova(fit.gam.1, fit.gam.0, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: vote ~ s(age) + education + s(statusquo)
## Model 2: vote ~ region + as.factor(population) + sex + s(age) + education + 
##     income + s(statusquo)
##   Resid. Df Resid. Dev     Df Deviance Pr(>Chi)
## 1    2422.5     1186.1                         
## 2    2408.2     1168.8 14.351   17.344   0.2589
fit.gam.2<-gam(vote~age+education+s(statusquo), family = binomial, data=df.c)
anova(fit.gam.2, fit.gam.1, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: vote ~ age + education + s(statusquo)
## Model 2: vote ~ s(age) + education + s(statusquo)
##   Resid. Df Resid. Dev     Df Deviance Pr(>Chi)
## 1    2423.6     1187.6                         
## 2    2422.5     1186.1 1.0579   1.4911   0.2368
fit.gam.3<-gam(vote~s(age)+education+statusquo, family = binomial, data=df.c)
anova(fit.gam.3, fit.gam.1, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: vote ~ s(age) + education + statusquo
## Model 2: vote ~ s(age) + education + s(statusquo)
##   Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
## 1    2424.7     1233.6                              
## 2    2422.5     1186.1 2.1675   47.505 6.601e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.gam<-fit.gam.2
summary(fit.gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## vote ~ age + education + s(statusquo)
## 
## Parametric coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.334742   0.302930   7.707 1.29e-14 ***
## age          0.017437   0.006382   2.732  0.00629 ** 
## educationPS -0.211363   0.250555  -0.844  0.39890    
## educationS  -0.486749   0.193130  -2.520  0.01173 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df Chi.sq  p-value    
## s(statusquo) 2.759  3.435   41.2 2.01e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0293   Deviance explained = 6.36%
## UBRE = -0.50591  Scale est. = 1         n = 2431
plot(fit.gam)

  1. If the results appear to be substantially nonlinear, can you deal with the nonlinearity in a suitably respecified generalized linear model (e.g., by transforming one or more explanatory variables)?

Answer:

Yes, adding higher-order terms of “statusquo”. If the degree of the polynomial is 3, from the anova, we can see that this model performs better than the non-parametric model. Thus, we have successfully fit the nonlinearity of the relationship.

plot(vote~statusquo, data = df.c)

fit.glm<-gam(vote~age+education+poly(statusquo, 3, raw = TRUE), family = binomial, data=df.c)
anova(fit.glm, fit.gam, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: vote ~ age + education + poly(statusquo, 3, raw = TRUE)
## Model 2: vote ~ age + education + s(statusquo)
##   Resid. Df Resid. Dev      Df Deviance Pr(>Chi)
## 1    2424.0     1186.5                          
## 2    2423.6     1187.6 0.43452   -1.082

4. Exercise E18.7

For this analysis, use the Duncan.txt data. Here we are interested in the outcome prestige and the explanatory variable income.

  1. Fit the local-linear regression of prestige on income with span \(s = 0.6\) (see Figure 18.7 in the book). This has 5.006 equivalent degrees of freedom, very close to the number of degrees of freedom for a fourth order polynomial.
library(car)
df.d<-Duncan
fit.ll<-loess(prestige ~ income, data = df.d, span = 0.6, degree = 1)
summary(fit.ll)
## Call:
## loess(formula = prestige ~ income, data = df.d, span = 0.6, degree = 1)
## 
## Number of Observations: 45 
## Equivalent Number of Parameters: 3.12 
## Residual Standard Error: 17.53 
## Trace of smoother matrix: 3.61  (exact)
## 
## Control settings:
##   span     :  0.6 
##   degree   :  1 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE
## drop.square:  FALSE
  1. Fit a fourth order polynomial of the data and compare the resulting regression curve with the local-linear regression.

Answer:

From the plot, we can see the two fitted lines are similar until “income” value greater than 60. And fourth order polynomial regression is more “curved” at the right boundary.

fit.p4<-lm(prestige ~ poly(income, 4, raw = TRUE), data = df.d)
inc.200<-with(df.d, seq(min(income), max(income), length.out = 200))
pred.ll<-predict(fit.ll, data.frame(income=inc.200))
pred.p4<-predict(fit.p4, data.frame(income=inc.200))
#anova(fit.ll, fit.p4, test="Chisq")

plot(prestige~income, data = df.d)
lines(inc.200, pred.ll, lty=1, lwd=2, col="red")
lines(inc.200, pred.p4, lty=2, lwd=2, col="green")
legend(5, 98, legend=c("local-linear regression", "4th order polynomial"),
       col=c("red","green"), lty=1:2, cex=1)